The data presented in this report are part of a study aimed to assess differential gene expression and methylation in vaping versus non-vaping youths in Pueblo and Denver, CO. Pulmonary function data were also obtained to better understand the impacts of vape use on pulmonary function. To assess differential gene expression and methylation, naso-epithelial swabs were obtained from each participating subject. Pulmonary function is assessed using PFTs (Pulmonary Function Tests) and Impulse Oscillometry (IOS).
This data set consists of samples taken from 51 people ages 12-17 yrs from Pueblo, Denver, and Aurora, CO.
All analyses performed using R version 4.2.1 (2022-06-23)
Vape status is self-reported based on three questions.
These three questions were used to construct a dichotomous variable which defines subjects who have vaped in the last 6 months. This variable will be referred to as Vape Status. One participant (SID = 111) reported that they had used a vaping device 5 out of the last 30 days, but did not respond to last_vape. That participant is recorded as Vaped in this analysis.
Biological sex has been derived from methylation data utilizing the getSex function from the R package minfi 1.43.0.
Subjects’ geographic location, city, was grouped into the new broader variable recruiting_center which encompasses the broader geographic region where they live.
Measures of lung function and IOS were visually inspected for normality using histograms.
Annotation: Ensembl annotation for GrCH 39 ver. 37
Removal of Unwanted Variance: edgeR 3.39.6 and RUVSeq
1.31.0
Differential Expression Analysis: DESeq2 1.37.5
Gene Filtering Parameters
This analysis will conduct a comparison of various gene-filtering
parameters presented in previous analyses and in the current literature
to select parameters best-suited for this study.
Normalization
The following analyses used the function RUVr from the R package RUVSeq.
RUVr uses the deviance residuals from a first pass negative binomial GLM
to perform a factor analysis which corrects for unwanted technical
effects. The first-pass model formula is presented below :
\[raw \ read \ count \sim \beta_0 + \beta_1 * vape \ status \ + \beta_2 *male \ + \beta_3 * age\]
RUVr will be performed with k = 1 and k = 2 factors and the best k for factor analysis will be determined visually using an elbow plot, RLE plots, and dendrograms. Previous analyses used R package DESeq2 to fit the first pass GLM. This analysis will use edgeR due to its reference in the literature for the RUVr procedure mentioned above.
After removing participants with missing values for vape status, we are left with n = 50 subjects. Previous analyses showed n = 12 participants had vaped in the last 6 months. This analysis will use n = 13 participants who had vaped in the last 6 months. The lung function variable fev1 and fev1_fvc reported 22 missing values. IOS measures r5 and x20 reported 1 and 6 missing values, respectively.
#create table 1
vape_6mo_table1 <- tab1_dat %>%
tableby(vape_6mo_lab ~ sex_lab + age + recruitment_center +
latino_lab + fev1 + fev1_fvc +
r5 + x20, data = ., digits = 1, test = FALSE )
#Fix Labels
arsenal::labels(vape_6mo_table1) <- c(vape_6mo_lab = "Vaped in last 6 months",
age = "Age (yrs)", sex_lab = "Sex",
recruitment_center = "Recruitment Center",
latino_lab = "Ethnicity", fev1 = 'FEV1',
fev1_fvc = "FEV1/FVC (%)", r5 = 'R5', x20 = 'X20')
p_male <- (sum(tab1_dat$sex_lab == 'Male')/length(tab1_dat$sex_lab))*100
#Print Tables
summary(vape_6mo_table1, pfootnote = T)| Did Not Vape in Last 6 Months (N=37) | Vaped in Last 6 Months (N=13) | Total (N=50) | |
|---|---|---|---|
| Sex | |||
| Female | 17 (45.9%) | 8 (61.5%) | 25 (50.0%) |
| Male | 20 (54.1%) | 5 (38.5%) | 25 (50.0%) |
| Age (yrs) | |||
| Mean (SD) | 14.6 (1.4) | 14.8 (1.4) | 14.6 (1.4) |
| Range | 12.0 - 17.0 | 13.0 - 17.0 | 12.0 - 17.0 |
| Recruitment Center | |||
| Aurora | 15 (40.5%) | 0 (0.0%) | 15 (30.0%) |
| CommCity/Denver | 13 (35.1%) | 1 (7.7%) | 14 (28.0%) |
| Pueblo | 9 (24.3%) | 12 (92.3%) | 21 (42.0%) |
| Ethnicity | |||
| LatinX | 23 (62.2%) | 11 (84.6%) | 34 (68.0%) |
| Non-LatinX | 14 (37.8%) | 2 (15.4%) | 16 (32.0%) |
| FEV1 | |||
| N-Miss | 10 | 12 | 22 |
| Mean (SD) | 2.6 (0.7) | 3.9 (NA) | 2.6 (0.7) |
| Range | 1.2 - 3.9 | 3.9 - 3.9 | 1.2 - 3.9 |
| FEV1/FVC (%) | |||
| N-Miss | 10 | 12 | 22 |
| Mean (SD) | 0.8 (0.1) | 0.7 (NA) | 0.8 (0.1) |
| Range | 0.5 - 1.0 | 0.7 - 0.7 | 0.5 - 1.0 |
| R5 | |||
| N-Miss | 1 | 0 | 1 |
| Mean (SD) | 4.0 (0.9) | 5.0 (1.3) | 4.3 (1.1) |
| Range | 2.0 - 6.1 | 3.7 - 7.6 | 2.0 - 7.6 |
| X20 | |||
| N-Miss | 4 | 2 | 6 |
| Mean (SD) | 0.1 (0.6) | 0.7 (0.9) | 0.2 (0.7) |
| Range | -1.1 - 2.4 | -1.0 - 2.3 | -1.1 - 2.4 |
#Fix sex Bar Plot
sex_plot_trial <- tab1_dat %>%
group_by(sex_lab, vape_6mo_lab) %>%
summarise(N = n())
#By Recruitment sex (n = 50)
bar_sex <- sex_plot_trial %>%
ggplot(aes(x = sex_lab, y = N, fill = vape_6mo_lab)) +
geom_bar(stat = 'identity', position = 'dodge') +
labs(y = "Count", fill = "Vape Status") +
ggtitle("Sex") +
scale_fill_manual(values = brewer.pal(3, "Set2"))+
theme(axis.title.x=element_blank())
#By Latino(n = 50)
bar_latino <- tab1_dat %>%
ggplot(aes(x = latino_lab, fill = vape_6mo_lab))+
geom_bar(position = "dodge")+
labs( y = "Count", fill = "Vape Status") +
ggtitle("LatinX") +
scale_fill_manual(values = brewer.pal(3, "Set2")) +
theme(axis.title.x=element_blank())
#Fix Recruitment Center Bar Plot
center_plot_trial <- tab1_dat %>%
group_by(recruitment_center, vape_6mo_lab) %>%
summarise(N = n())
center_plot_trial[nrow(center_plot_trial) + 1,] <- NA
center_plot_trial[6,] <- list("Aurora", "Vaped in Last 6 Months", 0)
#By Recruitment Center
recruitment_center_hist <- center_plot_trial %>%
ggplot(aes(x = recruitment_center, y = N, fill = vape_6mo_lab)) +
geom_bar(stat = 'identity', position = 'dodge') +
labs(y = "Count", fill = "Vape Status") +
ggtitle("Center")+
scale_fill_manual(values = brewer.pal(3, "Set2")) +
theme(axis.title.x=element_blank())
#By FEV1/FVC Continuous
hist_fev1_fvc <- tab1_dat %>%
ggplot(aes(x = fev1_fvc, fill = vape_6mo_lab))+
geom_histogram(binwidth = 0.1, breaks = seq(0.5,1,0.1), position = "dodge")+
labs(x = TeX("\\frac{FEV1}{FVC}"), y = "Count", fill = "Vape Status:") +
xlim(0.5,1) +
scale_fill_manual(values = brewer.pal(3, "Set2")) +
ggtitle("FEV1/FVC")
#By FEV1/FVC Box
box_fev1_fvc <- tab1_dat %>%
ggplot(aes(x = vape_6mo_lab, y = fev1_fvc, fill = vape_6mo_lab)) +
geom_boxplot(show.legend = F) +
labs(x = "Vape Status", y = TeX("\\frac{FEV1}{FVC}")) +
scale_fill_manual(values = brewer.pal(3, "Set2")) +
ggtitle(TeX("\\frac{FEV1}{FVC} \ by Vape Status (n = 28)"))
#By R5(hist)
r5_hist <- tab1_dat %>%
ggplot(aes(x = r5, fill = vape_6mo_lab)) +
geom_histogram(binwidth = 1, breaks = seq(1,8,1), position = 'dodge') +
labs(x = "R5", y = "Count", fill = "Vape Status") +
xlim(1,8) +
scale_fill_manual(values = brewer.pal(3, "Set2")) +
ggtitle("R5")
#by R5(box)
r5_box <- tab1_dat %>%
ggplot(aes(x = vape_6mo_lab, y = r5, fill = vape_6mo_lab)) +
geom_boxplot(show.legend = F) +
labs(x = "Vape Status", y = "R5") +
ggtitle("R5 by Vape Status (n = 49)") +
scale_fill_manual(values = brewer.pal(3, "Set2")) +
theme(axis.text.x = element_blank(),
axis.ticks.x = element_blank())
#by X20(box)
x20_box <- tab1_dat %>%
ggplot(aes(x = vape_6mo_lab, y = x20, fill = vape_6mo_lab)) +
geom_boxplot(show.legend = F) +
labs(x = "Vape Status", y = "X20") +
ggtitle("X20 by Vape Status (n = 44)") +
scale_fill_manual(values = brewer.pal(3, "Set2")) +
theme(axis.text.x = element_blank(),
axis.ticks.x = element_blank())
#by x20(his)
x20_hist <- tab1_dat %>%
ggplot(aes(x = x20, fill = vape_6mo_lab)) +
geom_histogram(bins = 10, breaks = seq(-2,3,1), position = "dodge") +
labs(x = 'X20', y = "Count", fill = 'Vape Status') +
scale_fill_manual(values = brewer.pal(3, "Set2")) +
ggtitle("X20")Figure 1 visualizes the demographic information presented in Table 1. Most vaping subjects were recruited in Pueblo (92%) and identified as LatinX (85%). 50% of subjects were male as confirmed from the available methylation data.
Figure 2 is a visualization of the pulmonary function (\(\frac{FEV1}{FVC}\)) and IOS (R5 and X20) variables. \(\frac{FEV1}{FVC}\) was only completed by n = 22 individuals from the study population. R5 and X20 represent n = 49 and n = 44 individuals, respectively.
#Respiratory Figures
ggarrange(hist_fev1_fvc, ggarrange(r5_box, x20_box, ncol = 2, nrow = 1, legend = "none"), nrow = 2, common.legend = T, legend = "bottom")Figure 3 identifies correlation patterns in the clinical variables of
interest. To test for correlation, a contingency table was made for each
comparison of variables. Because some contingency tables contained cells
with <5 subjects, a Fisher’s Exact test was used for all comparisons.
From the correlation matrix, it appears that there may be significant correlation between vape status and recruitment center. A separate t-test was run to determine independence between vape status and age.
| Test Variable | Group 1 | Group 2 | Mean Group 1 | Mean Group 2 | t | df | p-value |
|---|---|---|---|---|---|---|---|
| Age | Did Not Vape in Last 6 Months | Vaped in last 6 months | 14.6 | 14.9 | -0.7 | 16.6 | 0.5 |
After merging clinical and gene-count data, there were n = 47 samples present. From the clinical metadata, SID = 105, 137, and 103 are missing genetic data. Sample 23 (SID = 144) was also removed due to missing vape status.
The table below compares gene-filtering parameters from previous analyses to a parameter presented by the creators of the RUVseq package.
source(here("Code/02_gene_filter_comparison.R"), local = knitr::knit_global())
filter_compar| Filter | Incluison Criteria | Gene Count Before | Gene Count After | Genes Removed |
|---|---|---|---|---|
| 1 | At least 25% of the samples have > 0 reads | 60,651 | 31,505 | 29,146 |
| 2 | The range of reads across all samples < 100 | 31,505 | 16,860 | 14,645 |
| 3 | >5 reads in at least 2 samples (Bioconductor) | 60,651 | 29,141 | 31,510 |
The figure below is used to visually assess how many genes are removed across cutoff values for the range of reads for each gene across the 47 samples after filtering out genes that have 0 read counts in 75% or more of the samples. The red point represents the number of genes (14,645) removed at the cutoff range of 100 (the value used in previous analyses).
After reviewing filter comparisons, it appears that filtering parameters presented by the creators of RUVSeq may be overly conservative for this application. Analyses will proceed using the same filters as previous analyses (filters 1 & 2 from table 2). The first filter will remove genes with 0 reads in more than 75% of samples, and the second will remove genes with low variation (range of reads < 100) that may be considered “house-keeping” genes. This analysis will include a total of n = 16,860 genes.
The following figure shows the Relative Log Expression (RLE) and Principal Component Analysis (PCA) of read counts for each sample without any prior transformations.
It appears that Sample 12 (SID = 102) may be an outlier. Below are the RLE and PCA plots with the sample removed.
To further assess the removal of sample 12, the following is an Elbow Plot which displays the reduction in within-cluster sums of squares when increasing the number of clusters (k) both with and without sample 12.
To visually inspect for the best cutoff for factor analysis, RUVr was run for k = 1 and k = 2 factors both with and without Sample 12 included in the dataset. The RLE and PCA plots for each are presented below.
As was discussed previously, it appears that Sample 12 may be an
outlier. Below are the RLE plots after RUVr was conducted with k = 1 and
k = 2 with Sample 12 removed from the dataset.
To help additionally compare, the following are side-by-side PCA plots
comparing samples with and without sample 12 removed before RUVr, after
RUVr with k = 1, and after RUVr with k = 2.
The dendrograms for k = 1 and k = 2 are compared below with and without Sample 12 included.
Using the figures above (with special attention towards Figure 10), It seems appropriate to keep sample 12 and use k = 2 factors for RUVr normalization.
After review of the demographic composition of the participants in this study and the distribution of read counts across subjects, there are two key takeaways. The first is that there is a class imbalance when stratifying the study population by both vape status and recruitment center. Almost all vaping subjects were recruited from the Pueblo center, and 0 were recruited from the Aurora center. The implication of this class imbalance is that results may not generalize well to other populations. In order to assess the presence and direction of potential bias, a sensitivity analysis will be conducted for the recruitment center variable in which we look at the results for Pueblo subjects alone.
The other key takeaway is that Sample 12 is of particular concern as an outlier with regard to sample quality and read count. The inclusion of k = 2 RUVr factors were able to account for the outlying nature of this sample, but a separate sensitivity analysis will still be conducted as a gut-check to ensure that Sample 12 does not significantly bias our model.
Between this report and it’s previous version (“teen_vape_exploratory_report_2022_04_15”), the variable LatinX was included in the model as opposed to the variable Age. This report corrects that mistake. Visually, it seems to make little differences in the figures used to select K.